Introduction to R and Rstudio
Session - Cleaning data with {dplyr}
Artwork by @allison_horst
Wrangling
Is the reshaping or transforming of data into a format which is easier to work with
This is often the largest part of many analyses and data science
A note on tidy data
Tidyverse functions work best with tidy data:
Each variable forms a column.
Each observation forms a row.
(Broadly, this means long rather than wide tables)
{dplyr} package
{dplyr} is a language for data manipulation
Most wrangling puzzles can be solved with knowledge of just a few {dplyr} verbs or functions
Many of the concepts of these functions exist in SQL but {dplyr} (and other packages) can extend this further
6 functions/verbs to start with
6 key verbs will help us gain a deeper understanding of our data sets.
Note summarise() can also be spelt summarize()
dplyr:: arrange ()
dplyr:: filter ()
dplyr:: mutate ()
dplyr:: group_by ()
# followed often by
dplyr:: ungroup ()
dplyr:: summarise ()
Building with steps
These verbs aren’t used independently of each other.
Each can be a step in the code, like a recipe but can also be repeated.
A recipe starts with:
potato then
peel then
slice into medium sized pieces then
boil for 25 minutes then
mash
Recipe as code
The potato is the object in R terms and the steps are verbs or functions
Take a potato thenpeel thenslice into medium sized pieces thenboil for 25 minutes thenmash
potato |>peel() |>slice(size = "medium" |>boil(time = 25) |>mash()
the |> can be replaced with the word ‘then’ in this recipe scenario
Pipe
Shortcut key Ctrl+Shift+m
You might be familiar with the pipe |> which comes from the {magrittr} and is available through the {tidyverse} but the new pipe |> doesn’t require any packages to run
Q1. Which organisation provided the highest number of Mental Health (MH) beds?
arrange()
Reorder rows based on selected variable
beds_data |>
arrange (beds_av)
Descending data
We need descending order:
beds_data |>
arrange (desc (beds_av))
desc() works for text and numeric variables
Q2. Which 2 organisations provided the highest number of MH beds in September 2018?
We’ll use arrange() as before to get the “highest number”
But we require only observations with the date “September 2018”
filter()
The expression inside brackets should return TRUE or FALSE. We are choosing rows where this expression is TRUE.
beds_data |>
filter (date == "2018-09-01" )
A negative test of equality
To exclude and test where the expression is NOT equal !=
beds_data |>
filter (date != "2018-09-01" )
Ordered and filtered
filter() first to reduce the number of rows to apply the next code to
beds_data |>
filter (date == "2018-09-01" ) |>
arrange (desc (beds_av))
Find the top 2 organisations
This isn’t a key function but useful and there are many other functions for slice...
beds_data |>
arrange (desc (beds_av)) |>
filter (date == "2018-09-01" ) |>
slice_head (n = 2 )
Q3. Which organisations had the highest percentage bed occupancy in September 2018?
We’ll use arrange() as before to find “highest”
We’ll use filter() as before to restrict by date “September 2018”
But we don’t have a percentage variable in the data
Create new variables
= in this context is an alias not a test of equality
beds_data |>
mutate (perc_occ = occ_av / beds_av) |>
filter (date == "2018-09-01" ) |>
arrange (desc (perc_occ))
Q4. What was the mean number of beds (for the dataset)?
Let’s first look at how we’d produce summary statistics like a mean
And then see how this can be applied to groups of data like organisations
summarise()
Collapses a single summary value
beds_data |>
summarise (mean_beds = mean (beds_av))
Missing values
We’ll need to remove NA values to get a suitable mean. TRUE can also be T
beds_data |>
summarise (mean_beds = mean (beds_av,
na.rm = TRUE ))
Have a go!
Instead of mean() use median()
object |>
summarise (new_name = function_name (column_name,
na.rm = ???))
Use a sum() statistic twice
object |>
summarise (col_1 = function_name (beds_av,
na.rm = ???),
col_2 = function_name (occ_av,
na.rm = ???)
)
Answer for summary statistics
median()
beds_data |>
summarise (per_occ = median (beds_av,
na.rm = TRUE ))
# A tibble: 1 × 1
per_occ
<dbl>
1 241
sum()
beds_data |>
summarise (total_beds = sum (beds_av, na.rm = TRUE ),
total_occupacy = sum (occ_av, na.rm = TRUE ))
# A tibble: 1 × 2
total_beds total_occupacy
<dbl> <dbl>
1 412480 368434
Applying summarise() to groups
Now we know how to use summarise() (mean() and also sum() )
We’ll produce a summary value for each value of date
group_by() - persistent grouping
group_by() does nothing to the output alone.
The change occurs behind the scenes.
beds_data |>
group_by (date)
ungroup()
Seeing ungroup()
The data is unchanged but the metadata changes
beds_data |>
group_by (date)
beds_data |>
group_by (date) |>
ungroup ()
Break?
Option to take this break before an exercise of after
Q5. Which organisations have the highest mean % bed occupancy?
For each of the organisations group_by()
Add 2 summary statistics, total_beds/total_occupancy to summarise() the data
mutate() the data to create a percentage using the previous
Order to find highest by using arrange()
Hint
beds_data |>
group_by () |>
summarise () |>
mutate () |>
arrange ()
Solution
beds_data |>
group_by (org_name) |>
summarise (total_beds = sum (beds_av, na.rm = TRUE ),
total_occupancy = sum (occ_av, na.rm = TRUE )) |>
mutate (perc_occ = total_occupancy / total_beds) |>
arrange (desc (perc_occ))
End of session